Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports

نویسندگان

  • Ramakanth Kavuluru
  • Isaac Hands
  • Eric B. Durbin
  • Lisa Witt
چکیده

Although registry specific requirements exist, cancer registries primarily identify reportable cases using a combination of particular ICD-O-3 topography and morphology codes assigned to cancer case abstracts of which free text pathology reports form a main component. The codes are generally extracted from pathology reports by trained human coders, sometimes with the help of software programs. Here we present results that improve on the state-of-the-art in automatic extraction of 57 generic sites from pathology reports using three representative machine learning algorithms in text classification. We use a dataset of 56,426 reports arising from 35 labs that report to the Kentucky Cancer Registry. Employing unigrams, bigrams, and named entities as features, our methods achieve a class-based micro F-score of 0.9 and macro F-score of 0.72. To our knowledge, this is the best result on extracting ICD-O-3 codes from pathology reports using a large number of possible codes. Given the large dataset we use (compared to other similar efforts) with reports from 35 different labs, we also expect our final models to generalize better when extracting primary sites from previously unseen reports.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Changes in cancer registry coding for lymphoma subtypes: reliability over time and relevance for surveillance and study.

Because lymphoma comprises numerous histologic subtypes, understanding the reasons for ongoing increases in its incidence requires surveillance and etiologic study of these subtypes. However, this research has been hindered by many coexisting classification schemes. The Revised European American classification of Lymphoid Neoplasms (REAL)/WHO system developed in 1994 and now used in clinical se...

متن کامل

Assessing the Utility of Automatic Cancer Registry Notifications Data Extraction from Free-Text Pathology Reports

Cancer Registries record cancer data by reading and interpreting pathology cancer specimen reports. For some Registries this can be a manual process, which is labour and time intensive and subject to errors. A system for automatic extraction of cancer data from HL7 electronic free-text pathology reports has been proposed to improve the workflow efficiency of the Cancer Registry. The system is c...

متن کامل

Fractal Study on Nuclear Boundary of Cancer Cells in Urinary Smears

  Background & Objectives: Cancer is a serious problem for human being and is becoming a serious problem day-by-day .A prerequisite for any therapeutic modality is early diagnosis. Automated cancer diagnosis by automatic image feature extraction procedures can be used as a feature extraction in the field of fractal dimension. The aim of this survey was to introduce a quantitative and objective...

متن کامل

Development of Query Strategies to Identify a Histologic Lymphoma Subtype in a Large Linked Database System

BACKGROUND Large linked databases (LLDB) represent a novel resource for cancer outcomes research. However, accurate means of identifying a patient population of interest within these LLDBs can be challenging. Our research group developed a fully integrated platform that provides a means of combining independent legacy databases into a single cancer-focused LLDB system. We compared the sensitivi...

متن کامل

Flowers et al.indd

Background: Large linked databases (LLDB) represent a novel resource for cancer outcomes research. However, accurate means of identifying a patient population of interest within these LLDBs can be challenging. Our research group developed a fully integrated platform that provides a means of combining independent legacy databases into a single cancer-focused LLDB system. We compared the sensitiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2013  شماره 

صفحات  -

تاریخ انتشار 2013